Investigating Common-Item Screening Procedures in Developing a Vertical Scale

نویسندگان

  • Marc Johnson
  • Qing Yi
چکیده

Creating a vertical scale involves several decisions on assessment designs and statistical analyses to determine the most appropriate vertical scale. This research study aims at investigating common item stability check procedures to arrive at vertical linking item sets that will produce the necessary constants for computing vertical theta (ability) estimates and scale scores on a vertical scale metric. The research reported in this paper, investigates the phenomenon of common items (across adjacent levels) that have lower difficulty estimates (“easier”) at the lower level than at the upper level and the subsequent vertical scales. A major finding of this research is that the presence of linking items that appeared to be easier at the lower level than at the upper level can still lead to patterns of increasing achievement growth from the lowest level of the scale to the highest level. COMMON-ITEM SCREENING IN A VERTICAL SCALE 2 Investigating Common-Item Screening Procedures in Developing a Vertical Scale Introduction Vertical scaling is a process of placing scores on tests that measure similar constructs, but at different educational levels, onto a common scale (Kolen & Brennan, 2004). Vertical scales, therefore, are thought of as a progression of scale scores used to monitor academic achievement across age/grade levels (hereafter, levels). The need for vertical scales has received much attention in the recent decade due to No Child Left Behind (NCLB) requiring that assessment programs track academic progress. Despite the prevalence of vertical scales in assessment, at both the national and state levels of assessment programs, the methodologies to derive vertical scales are numerous and often can produce different results. In deriving vertical scales, practitioners often must choose among scaling methodologies (e.g., item response theory (IRT), Thurstone scaling), vertical linking strategies across levels (e.g., concurrent, separate level-groups, level-by-level), and scaling designs (e.g., scaling test, common items across levels, equivalent groups design). There are other factors that should be considered when designing a vertical scale and there are studies that are devoted to analyzing these factors and how various combinations of these factors affect resulting vertical scales (Ito, Sykes, & Yao, 2008; Tong & Kolen, 2007). These research studies of vertical scaling have not provided clear guidance on what factors should be used in combination to produce the “best” vertical scales. However, it is often that practitioners of vertical scales, or those interested in designing them, derive appropriate vertical scales by analyzing how combinations of these factors affect the vertical scales in relation to the expectation of growth within individual assessment programs. COMMON-ITEM SCREENING IN A VERTICAL SCALE 3 One factor that deserves more attention in vertical scaling is the set of items that ultimately is used to create the vertical link among levels. In other words, vertical scales are created via a set of items, regardless of the scaling design, that are responded to by examinees of differing levels. In the case of the common item approach, vertical linking items are assessed within on-level test forms as well as within off-level test forms. Within the equivalent groups design, examinees can be randomly assigned to respond to either an on-level test or an off-level test. However, with a scaling test design, examinees respond to a “test” that consists of all vertical linking items, across all levels. The scaling test is in addition to an on-level test from which scores are linked to the scaling test. In practice, examinee performance on the vertical linking items is compared between the off-level and on-level examinees. This comparison can result in items being removed from the vertical linking item set prior to the construction of vertical scales (analogous to common item screening in horizontal equating). Common item screening methodologies used in vertical linking studies can be the same procedures found in horizontal equating strategies (e.g., Robust Z analysis, perpendicular distance). However, the assumptions of item instability are different in the vertical linking context from those of conventional horizontal equating practices. In other words, in the context of vertical linking, it is expected that the vertical linking items will exhibit a differential in performance between on-level and off-level examinees whereas that expectation is irrelevant in horizontal equating studies. Therefore, this does raise the question of whether or not the common item screening methodologies used in horizontal equating are appropriate within vertical linking contexts. Should items be removed at all in vertical linking studies when a differential in performance between on-level and off-level examinees exists? COMMON-ITEM SCREENING IN A VERTICAL SCALE 4 The research interest expressed in this paper involves examining common item screening methodologies for vertical linking items and the impact of removal decisions on vertical scales. In other words, this study will investigate different procedures of adjusting vertical linking item sets and how these decisions affect resulting vertical scales. It has already been stated that there is some item performance differential expected in vertical linking studies, but this study will investigate varying degrees of this expectation and justifiable decisions that can be made based on the empirical differential in item performance. Linking Items in Equating In practice, horizontal equating statistically placing a test form onto a particular measurement scale is often times accomplished through a set of items designated as linking items. When a test form is being placed onto the measurement scale of another test form, the linking items are those items that are common to both test forms. However, when a test form is being placed onto the measurement scale of an item pool, the linking items can either be all scored test items or a set of the items. In either situation, a measurement link is established that allows a test form to be placed onto the same scale as a previous test form or the item pool. The selection of linking items, in the case of only representing a set of the tested items, has been considered critical to the design of horizontal equating studies and guidelines have been established that are continued to be used in the psychometric analyses within large scale assessment programs. These guidelines include test content representation relative to the entire test form, the position of the linking items throughout the test, the number of linking items in relation to the total number of test forms, and the statistical properties of the intended linking items usually based on past performance. Although important, the dissection of these guidelines is beyond the scope of this research study, but readers are referred to texts that discuss these COMMON-ITEM SCREENING IN A VERTICAL SCALE 5 guidelines in more detail (Klein & Jarjoura, 1985; Wingersky, Cook, & Eignor, 1987). Vertical linking can be accomplished from a variety of methods. One method is through the use of linking items, analogous to horizontal equating. When used as common items across adjacent levels, vertical linking item sets will mostly consist of items that students at the adjacent levels can respond to correctly. Linking item guidelines of horizontal equating, mentioned above, are applicable in the vertical linking context so that a strong measurement link can be established that will foster a reasonable scale of growth across all levels. The scaling test method of vertical linking, however, relies on examinees responding to an on-level test as well as a test that consists of items spanning all levels (the scaling test; Kolen & Brennan, 2004). Linking Item Performance in Equating – Stability Check Procedures When using linking items to determine a measurement link between test forms or between a test form and an item pool, the item statistics are analyzed and compared between previous item statistics and newly obtained statistics. Under Rasch, the IRT statistics can be compared through the use of procedures such as the Robust Z analysis (Huynh, Gleaton, & Seaman, 1992), perpendicular distances mentioned earlier, as well as the 0.3-logit difference procedure (Miller, Rotou, & Twing, 2004). All of these procedures (discussed below), referred to as item stability checks, aim at identifying the items that show a greater than expected difference between the old and new statistics, each with its own criteria of acceptable difference. In practice, the items identified at this stage are considered to be removed from the linking item set before the final measurement link is establish and the scaling of raw scores to scale scores. However, there are guidelines around how common items are removed from the linking item set for each procedure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test

According to the equating guidelines, a set of common items should be a mini version of the total test in terms of content and statistical representation (Kolen & Brennan, 2004). Differences between vertical scaling and equating would suggest that these guidelines may not apply to vertical scaling in the same way that they apply to equating. This study investigated how well the guideline of con...

متن کامل

Investigating the Impact of Response Format on the Performance of Grammar Tests: Selected and Constructed

When constructing a test, an initial decision is choosing an appropriate item response format which can be classified as selected or constructed. In large-scale tests where time and finance are of concern, the use of response chosen known as multiple-choice items is quite widespread. This study aimed at investigating the impact of response format on the performance of structure tests. Concurren...

متن کامل

بررسی میزان عوارض قلبی بدنبال اعمال جراحی عروقی اورژانس، بیمارستان سینا، 79-1378

Complications of Coronary artery disease remain the most common cause of morbidity and mortality after vascular surgical procedures. Goldman risk factor analysis has been suggested as peri-operative noninvasive screening method to detect significant coronary artery disease in emergent vascular procedures.Methods and Materials: In this study, the accuracy of the Goldman scale was assessed with r...

متن کامل

Developing a Psychometric Scale for Brief Evaluation of Outpatient Satisfaction

Background and Objectives: Patient satisfaction is a key feature of quality improvement in modern health care systems. The focus of patient satisfaction studies has been on inpatient satisfaction measurement. As such, valid and reliable instruments for assessment of outpatient satisfaction are lacking in the field. This study aimed to develop and validate a brief scale to facilitate assessing o...

متن کامل

Growth Scales as an Alternative to Vertical Scales - Practical Assessment, Research & Evaluation

Student growth models depend on comparing assessments of individual students over time. Vertical scales (c.f. Kolen and Brennan, 2004) are among several options that exist for development of scales that allow these comparisons. Briefly, vertical scales are created through administering an embedded subset of items to different students at two educational levels, typically one year apart, and lin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011